feat: spec-compliant hashing + 36 golden parity tests by agent-kurouto[bot] · Pull Request #2 · nauticalab/starfix-python

agent-kurouto · 2026-03-20T06:40:34Z

Summary

Rewrites arrow_digester.py to implement the byte-layout spec's BTreeMap decomposition for hash_array and the record-batch path:
- Lists decompose into separate validity-only, structural, and data entries (spec Section 3.4)
- hash_array now uses the same recursive decomposition as hash_record_batch (spec Section 6)
- Struct null AND-propagation follows spec Section 3.5
Adds tests/test_golden_parity.py with 45 tests (36 unique golden hashes) verified against the Rust starfix crate, covering:
- Spec examples A, B, C, D, E, F, G, H, I, J, K, N, O
- All primitive types (bool, int8–uint64, float32, float64)
- Temporal types (date32, date64, time32 s/ms, time64 us/ns)
- Variable-length (binary, string) with type canonicalization
- Decimal128 at three precision ranges (1–9, 10–18, 19–38)
- List, List/LargeList equivalence
- Struct column, nested struct (two levels), struct with nullable children
- Multi-type record batch (24 columns), schema-only hash
- Streaming two-batch incremental update, batch-split independence
Fixes list hash_array bug: Previously used a single combined entry instead of the spec's decomposed entries, producing incorrect hashes for any list-typed array

Test plan

All 40 existing tests pass (0 regressions)
All 45 new golden parity tests pass
Golden values generated independently by cargo run --bin emit_golden in Rust starfix (post-PR #11)

Resolves PLT-1043

🤖 Generated with Claude Code

…den parity tests Rewrites the core hashing algorithm to match the byte-layout spec: - hash_array now uses the same recursive BTreeMap decomposition as the record-batch path (spec Section 6), fixing list and struct hashing - Lists decompose into separate validity-only, structural, and data entries (spec Section 3.4) instead of a single combined entry - Recursive traversal properly AND-propagates struct nulls to descendants (spec Section 3.5) - Adds 45 golden-hash parity tests covering all primitive types, temporal types, variable-length types, decimals, lists, structs, nested structs, multi-type batches, and streaming — all verified against Rust starfix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the Python ArrowDigester implementation to follow the byte-layout spec’s recursive BTreeMap-style decomposition for hashing Arrow arrays and record batches, and adds a comprehensive golden-parity test suite to ensure byte-for-byte hash equality with the Rust implementation.

Changes:

Reworked ArrowDigester hashing to use a recursive traversal + decomposed digest-entry model for lists/structs (shared between hash_array and record-batch hashing).
Added extensive golden parity tests covering many Arrow types and spec examples, with expected hashes generated from the Rust crate.
Updated the existing list-array golden hash to reflect the corrected spec-compliant behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
src/starfix/arrow_digester.py	Major rewrite of hashing logic and decomposition/traversal implementation to match the spec.
tests/test_golden_parity.py	New golden parity regression suite (Rust-authoritative expected hashes).
tests/test_arrow_digester.py	Updates an existing list hash expectation to the new correct value.
.python-version	Pins local Python version to 3.12.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T06:47:10Z

src/starfix/arrow_digester.py

 """Pure-Python implementation of the starfix Arrow logical hasher.

-Produces identical hashes to the Rust implementation for all supported types.
+Implements the byte-layout specification from ``docs/byte-layout-spec.md``.


The module docstring references docs/byte-layout-spec.md, but there is no docs/ directory (and no file with that name) in this repository. This makes the reference non-actionable for maintainers; please either add the spec file at that path or update the reference to the correct location (e.g., a README section or an in-repo spec file).

Suggested change

Implements the byte-layout specification from ``docs/byte-layout-spec.md``.

Implements the Arrow byte-layout specification used by starfix.

Good catch — the spec lives in the Rust repo, not here. Fixed the docstring to reference nauticalab/starfix docs/byte-layout-spec.md instead of a non-existent local path.

Copilot · 2026-03-20T06:47:10Z

src/starfix/arrow_digester.py

+        children.sort(key=lambda f: f.name)
+        for child in children:
+            child_path = f"{path}{DELIMITER}{child.name}" if path else child.name
+            _extract_type_entries(child.type, child.nullable, child_path, out)


Struct null propagation seems incomplete at the type-decomposition stage: when a struct field is nullable, all descendant entries become effectively nullable (struct validity AND child validity per spec 3.5). Here _extract_type_entries() recurses into children using only child.nullable, so non-nullable children of a nullable struct will not get a null_bits buffer and their propagated null positions won’t be represented in the final digest (only skipped in data hashing). Consider recursing with child.nullable or nullable so child entries include null_bits whenever any ancestor struct is nullable.

Suggested change

_extract_type_entries(child.type, child.nullable, child_path, out)

_extract_type_entries(child.type, child.nullable or nullable, child_path, out)

This is intentional and matches both the spec and the Rust implementation.

The spec (Section 3.5, Example M) says struct-level nulls are AND-propagated at traversal time, not at entry-creation time. Non-nullable children of a nullable struct get data-only entries (no null_bits). The null propagation is handled by _traverse_struct which passes combined ancestor nulls down — _traverse_leaf then skips null rows when hashing data bytes.

From spec Example M:

Entry "a" (Int32, non-nullable → data-only): Struct nulls propagated: rows 0, 2 valid → data: [10, 30]

The Rust extract_type_entries also passes child_field.is_nullable() (not child.nullable || parent.nullable) — see arrow_digester_core.rs:934. All 126 parity tests (including nullable struct cases like r2_nullable_struct_batch and r2_nested_nullable_inner_struct) confirm this produces correct hashes.

Copilot · 2026-03-20T06:47:10Z

src/starfix/arrow_digester.py

+        inner = _normalize_field(dt.key_field)
+        return pa.map_(inner.type, dt.item_field.type)


_normalize_data_type()’s Map handling drops important information and may produce an incorrect canonical type: it only normalizes the key field (not the item/value field), it doesn’t preserve keys_sorted, and it reconstructs the map using pa.map_() defaults. If maps are intended to be supported, this should normalize both key/item fields and carry over keys_sorted to avoid schema/type hashes diverging from Arrow/Rust for map schemas.

Suggested change

inner = _normalize_field(dt.key_field)

return pa.map_(inner.type, dt.item_field.type)

key_field = _normalize_field(dt.key_field)

item_field = _normalize_field(dt.item_field)

return pa.map_(key_field.type, item_field.type, keys_sorted=dt.keys_sorted)

Good catch — the item field was not being normalized. Fixed to normalize both key and item fields, and preserve keys_sorted. Map types are not yet exercised by the test suite, but this is the correct normalization per the spec.

…sorting Adds test_golden_parity_r2.py with 41 additional golden-hash tests covering: - Empty and all-null arrays - Multi-byte bitvec (17-element boolean) - Single-element edge cases - Nullable struct via hash_array (struct-level nulls) - Three-level nested structs (batch, schema, hash_array) - Struct with list children - List<Struct> and nullable List<Struct> via hash_array - List<List<Int32>> nested lists (hash_array, batch, schema) - Mixed struct+list+scalar record batches - Unicode strings and varied-size binary - Float64 special values (NaN, ±inf, ±0.0) - Multi-batch streaming with struct and list columns - Complex schema-only hashes (12 fields, 4-level nesting) - Realistic geospatial-style record batch Also fixes a bug where hash_array did not sort JSON keys in the type metadata for struct types, causing struct hash_array to diverge from Rust. The spec requires all JSON keys sorted alphabetically at every nesting level. NOTE: One Rust bug discovered and filed as PLT-1047 — Rust panics on hash_array with List<nullable leaf> when inner elements contain nulls (sliced array null_count mismatch in traverse_leaf). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T07:19:02Z

src/starfix/arrow_digester.py

+    if pa.types.is_struct(canonical):
+        # Struct is transparent — no entry for the struct itself.
+        # Recurse into children sorted alphabetically (spec Section 3.5).
+        children = [canonical.field(i) for i in range(canonical.num_fields)]
+        children.sort(key=lambda f: f.name)
+        for child in children:
+            child_path = f"{path}{DELIMITER}{child.name}" if path else child.name
+            _extract_type_entries(child.type, child.nullable, child_path, out)
+


_extract_type_entries() treats structs as “transparent” and only uses each child field’s declared nullable to decide whether to allocate null_bits. If the parent struct itself is nullable, nulls must still be AND-propagated to descendants (spec 3.5), which requires descendant entries to include validity (null_bits) even when the child field is declared non-nullable. As written, nullable structs with non-nullable children will lose null position information (children have null_bits=None). Consider propagating nullable into the recursive calls (e.g., pass nullable or child.nullable) so descendants can record effective validity.

Same point as the earlier review — addressed in the reply there. This is intentional per the spec: struct null propagation happens at traversal time (via _traverse_struct passing combined ancestor nulls), not at entry-creation time. Non-nullable children of a nullable struct get data-only entries. See spec Example M and Rust extract_type_entries at line 934. Verified by 126 parity tests.

Copilot · 2026-03-20T07:19:02Z

tests/test_golden_parity.py

+from decimal import Decimal
+
+import pyarrow as pa
+import pytest


pytest is imported but not used in this test module. Removing unused imports helps keep the test code clean and avoids lint warnings in environments that enforce them.

Suggested change

import pytest

This was already removed in a subsequent commit — neither test_golden_parity.py nor test_golden_parity_r2.py imports pytest in the current version.

Copilot · 2026-03-20T07:19:02Z

tests/test_golden_parity_r2.py

+from __future__ import annotations
+
+import pyarrow as pa
+import pytest


pytest is imported but not used in this test module. Consider removing it to avoid unused-import warnings in linted environments.

Suggested change

import pytest

Same as above — pytest is not imported in the current version of this file.

- Fix docstring to reference spec in the Rust repo (not local path) - Normalize both key and item fields in Map type, preserve keys_sorted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

agent-kurouto · 2026-03-20T07:25:01Z

Response to Copilot review

All three comments addressed in commit 4bde6a1:

Docstring reference — Fixed. Now references nauticalab/starfix docs/byte-layout-spec.md instead of a non-existent local path.
Struct null propagation in _extract_type_entries — Intentional, no change. The spec (Section 3.5, Example M) propagates struct nulls at traversal time via _traverse_struct, not at entry-creation time. Non-nullable children of a nullable struct get data-only entries (no null_bits). This matches the Rust implementation (arrow_digester_core.rs:934) and is verified by 126 parity tests including r2_nullable_struct_batch and r2_nested_nullable_inner_struct.
Map normalization — Fixed. Now normalizes both key and item fields, and preserves keys_sorted. Map types aren't exercised by tests yet, but the fix is correct per the spec.

Runs pytest across Python 3.10, 3.11, 3.12 using uv on push to main and on pull requests. Covers all 126 golden parity + unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eywalker requested a review from Copilot March 20, 2026 06:40

Copilot started reviewing on behalf of eywalker March 20, 2026 06:42 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

eywalker requested a review from Copilot March 20, 2026 07:12

Copilot started reviewing on behalf of eywalker March 20, 2026 07:13 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

fix: address review comments — docstring ref, map normalization

4bde6a1

- Fix docstring to reference spec in the Rust repo (not local path) - Normalize both key and item fields in Map type, preserve keys_sorted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: add GitHub Actions workflow for Python tests

e6bbda6

Runs pytest across Python 3.10, 3.11, 3.12 using uv on push to main and on pull requests. Covers all 126 golden parity + unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eywalker merged commit b2fd8ca into main Mar 20, 2026
3 checks passed

eywalker deleted the eywalker/plt-1043-prepare-nauticalabstarfix-python-to-reach-logical-hashing branch March 20, 2026 07:50

	Implements the byte-layout specification from ``docs/byte-layout-spec.md``.
	Implements the Arrow byte-layout specification used by starfix.

	_extract_type_entries(child.type, child.nullable, child_path, out)
	_extract_type_entries(child.type, child.nullable or nullable, child_path, out)

		inner = _normalize_field(dt.key_field)
		return pa.map_(inner.type, dt.item_field.type)

-        inner = _normalize_field(dt.key_field)
-        return pa.map_(inner.type, dt.item_field.type)
+        key_field = _normalize_field(dt.key_field)
+        item_field = _normalize_field(dt.item_field)
+        return pa.map_(key_field.type, item_field.type, keys_sorted=dt.keys_sorted)

Conversation

agent-kurouto bot commented Mar 20, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

agent-kurouto bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

agent-kurouto bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

agent-kurouto bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

agent-kurouto bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

agent-kurouto bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

agent-kurouto bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

agent-kurouto bot commented Mar 20, 2026

Response to Copilot review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants